========================================================

Loading Data

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

About The Data Set

I am using Red Wine Quality data set, provided by Udacity, Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009.

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Univariate Plots Section

Exploring our data set

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Exploring our variables

Quality - The Main Variable of Our Investigation:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

All of our data are falling under quality between 3 and 8, however, most of our data are falling under 5 and 6.

We will try to find any relationship between the quality and other variabls later in the investigation.

I will create a new variable called quality.cat to transform the quality variable from numerical to categorical.

quality.cat is a categorical variable contains 3 values where: * 0 to 4 quality ratings fall under Low * 4 to 6 quality ratings fall under Medium * 6 to 8 quality ratings fall under High

Distribution of the rest of the variables

We have some variables (Residual Sugar, Chlorides, Free Sulfur Dioxide, Total Sulfur Dioxide, Alcohol, and Sulphates) that shows skewed shapes not a normal ditribution, or have a very long tail, thus, I will transform them using log10 trying to get normal distributions.

Univariate Analysis

What is the structure of your dataset?

1,599 observations and 12 variables, few of these variales are normally distributed (Fixed Acidity, Volatile Acidity, Des=nsity, and pH), the rest are skewed or/and having a very long tail.

## [1] 1599   13
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "quality.cat"

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature, as I am interested in investigating which chemical properties influence the quality of red wines.

What other features in the dataset do you think will help support your into your feature(s) of interest?

All of other variables will help me in my investigation. I will try to find the relationship between each variable and the quality variable to figure out which of these features influencing/ affecting on the quality.

Did you create any new variables from existing variables in the dataset?

Yes, I have created a new variable to convert the quality variable from numeric into categorical variable and called it quality.cat.

Of the features you investigated, were there any unusual distributions? you perform any operations on the data to tidy, adjust, or change the form  of the data? If so, why did you do this?

Many of the variables have a very long tail and skewed not normally distributed so I used log10 to transform the variables to get normal distributions.

Bivariate Plots Section

Pairing Variables

Looking at the plots I can see a kind of relationship between alcohol and quality, however, I will invistigate also if there are relationships between quality and any of these properities: * Sulphates * Citric Acid * Volatile Acidity

Do wines with higher alcoholic content receive better ratings?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## [1] 0.4761663

As we can see there is a positive relashionship however it is not strong enough, I will use the categorical variable of quality with boxplots for better understanding.

As we can see most of our data are falling under the meduim quality, however, yes there is a relationship between high quality and increasing in alcoholic content.

Do wines with higher Sulphates content receive better ratings?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
## [1] 0.2513971

There is a relationship between high quality and high sulphates, it is not as strong as alcohol relationship though. Let’s check with the categorical quality variable.

Yes, the relationship is positive and it is clear in this plot.

Do wines with higher Critic Acid receive better ratings?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## [1] 0.2263725

Yet, the relationship is not clear. Now the relationship is so clear, so we can say yes, there is a positive relationship between high quality rating and Citric Acid.

Do wines with high Volatile Acidity receive better ratings?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## [1] -0.3905578

Oh! this time we are looking at a negative relationship, let’s see the boxplot. And yes it is somehow a strong negative relationship between Quality rating and Volatile Acidity.

Relationships Between Supporting Variables

Now, let’s find if there are relashionships between other variables that have no relationship with the quality variable. Let’s have a look at the relationship between Density and pH. It is an obvious negative relationship.

What about Density and Fixed Acidity? A very strong positive relationship here.

So it must be a negative relationship between pH and Fixed Acidity. And yes! as expected, a very strong negative relationship between pH and Fixed Acidity.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the . How did the feature(s) of interest vary with other features in dataset?

I found that the strongest relationship between our main feature (Quality) and other features is between Quality and Alcohol, and it is a positive relationship. However, there are positive relationships between quality and Sulphates, and between Quality and Citric Acid. A negative relationship has been found between Quality and Volatile Acidity.

Did you observe any interesting relationships between the other features  What was the strongest relationship you found?

I found a very strong negative relationship between density and pH, and between Fixed Acidity and pH. While there is a positive relationship between Density and Fixed Acidity.

Multivariate Plots Section

In this section I will create plots for two variables, one that has a strong relationship with the main variable (quality). And will use the categorigal quality variable (quality.cat) for levels. This combination may lead for better understanding of what affects red win quality rating.

Alcohol With Other Propereties:

Alcohol with Density:

When looking at low density but high alcohol red dots are much less than high density and low alcohol. Thus, we can say that low density with high alcohol lead to medium or high quality rating.

Alcohol with Free Sulfur Dioxide:

Free sulfur dioxide and alcohol have no clear relationship with each other, however, higher sulfur dioxide has few red dots (low quality), and higher alcohol has few or no red dots (low quality) as well, so it is kind of there are 3 choices that will lead to a medium or high quality rating: 1. High alcohol and low free sulfur dioxide 2. High free sulfur dioxide and low alcohol 3. High alcohol and high free sulfur dioxide.

Critic Acid With Other Propereties:

Critic Acid with pH:

pH and citric acid have a negative relationship, as well as pH and quality, while critic acid has a positive relationship with quality. So, low pH and high critic acid will lead to a high chance of having a haigh quality rating.

Critic Acid with Fixed Acidity:

Fixed Acidity and Critic Acid have a strong positive relationship with each other, and because Critic Acid has a positive relationship with quality we can see that high quality dots are much more in high critic acid and high fixed acidity, but we still have low quality dots there as well!

Multivariate Analysis

Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?

When looking at low density but high alcohol red dots are much less than high density and low alcohol. Thus, we can say that low density with high alcohol lead to medium or high quality. pH and citric acid have a negative relationship, as well as pH and quality, while critic acid has a positive relationship with quality. So, low pH and high critic acid will lead to a high chance of having a haigh quality rating. Fixed Acidity and Critic Acid have a strong positive relationship with each other, and because Critic Acid has a positive relationship with quality we can see that high quality dots are much more in high critic acid and high fixed acidity, but we still have low quality dots there as well!

Were there any interesting or surprising interactions between features?

Free sulfur dioxide and alcohol have no clear relationship with each other, however, higher sulfur dioxide has few red dots (low quality), and higher alcohol has few or no red dots (low quality) as well, so it is kind of there is 3 choices that will lead to medium or high quality rating: 1. High alcohol and low free sulfur dioxide 2. High free sulfur dioxide and low alcohol 3. High alcohol and high free sulfur dioxide.


Final Plots and Summary

Plot One

Description One

This is the plot for our main variable for this investigation, which is Quality rating for red wines. As you can see here most of our data are falling uder 5 and 6 which means medium quality, thus, our investigation/predection for high or low quality rating will not be so accurate as we don’t have enough data falling under niether both.

Plot Two

Description Two

Fixed Acidity and Critic Acid have a strong positive relationship with each other, and because Critic Acid has a positive relationship with quality we can see that high quality dots are much more in high critic acid and high fixed acidity, but we still have low quality dots there as well!

Plot Three

Description Three

Free sulfur dioxide and alcohol have no clear relationship with each other, however, higher sulfur dioxide has few red dots (low quality), and higher alcohol has few or no red dots (low quality) as well, so it is kind of there is 3 choices that will lead to medium or high quality rating: 1. High alcohol and low free sulfur dioxide 2. High free sulfur dioxide and low alcohol 3. High alcohol and high free sulfur dioxide. ——

Reflection

Reflection on the project:

We have seen some relationships between the main variable and rest of variables, however, our investigation and foundings are not much accurate as we don’t have enough data under the low and high quality ratings, most of our data are falling under the medium quality rating which makes our investigation hard. We could also compare this data set (Red Wine Quality) with other data sets for different kind of wines like (White wine Quality), or we could combine both data sets to get more observations under low and high quality rating to succeed in our prediction.

Reflection on Learning R:

I found learning R much harder than learning Python or SQL. I would make the same project using Python and get almost same results with half of the time I spent it on R. However, R has more features for plotting than Python (for what I have learned so far), and creating plots on R is much fun and colorful. Would love to learn more about using R in data analysis but absolutely will be using Python and SQl more.